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Abstract 

Traditional  checkpointing  and  rollback  recovery 
techniques  for  parallel  systems  have  typically  assumed 
the  communication  pattern  is  specified  by  program  be- 
havtor.  In  this  paper  we  exploit  the  property  that  the 
communication  pattern  can  often  be  changed  at  run¬ 
time  without  affecting  program  correctness.  A  schedul¬ 
ing  algorithm  for  message  processing  and  its  imple¬ 
mentation  for  reducing  rollback  propagation  are  de¬ 
scribed.  The  algorithm  incorporates  a  user-transparent 
pnorxiized  scheme  based  upon  the  run-time  commu¬ 
nication  and  checkpointing  history.  Communication 
trace-driven  simulation  for  several  parallel  programs 
written  in  the  Chare  Kernel  language  demonstrates 
that  the  probability  of  rollback  propagation  can  be 
reduced  at  the  cost  of  slight  additional  performance 
degradation. 


1  Introduction 

Numerous  checkpointing  and  rollback  recovery 
techniques  have  been  proposed  in  the  literature  for 
parallel  systems.  Rollback  propagation  and  the  as¬ 
sociated  domino  effect  [1]  have  been  the  primary  is¬ 
sue  of  concern  in  many  of  these  techniques.  Check¬ 
pointing  for  parallel  and  distributed  systems  can  be 
classified  into  three  primary  categories.  Coordinated 
checkpointing  schemes  synchronize  computation  with 
checkpointing  by  coordinating  processors  during  a 
checkpointing  session  in  order  to  maintain  a  consis¬ 
tent  set  of  checkpoints  [2-4].  Rollback  propagation  is 
avoided  at  the  cost  of  potentially  significant  perfor¬ 
mance  degradation  during  normal  execution.  Loosely- 
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synchronized  checkpointing  schemes  [5,6]  reduce  the 
overhead  for  coordination  by  taking  advantage  of 
the  loosely-synchronized  checkpointing  clocks  and  by 
bounding  the  message  transmission  delay.  Indepen¬ 
dent  checkpointing  schem.es  replace  the  checkpoint 
synchronization  by  dependency  tracking  and  possibly 
message  logging  [7-11]  in  order  to  preserve  process 
autonomy.  Rollback  propagation  in  case  of  a  fault  is 
managed  by  searching  for  a  consistent  system  state 
based  on  the  dependency  information.  Lower  run¬ 
time  overhead  during  normal  e.xecution  is  achieved 
by  allowing  slower  recovery  and  maintaining  multiple 
checkpoints.  Our  paper  considers  independent  check¬ 
pointing  schemes  for  possibly  nondeterministic  execu¬ 
tion. 

Research  on  rollback  recovery  in  multiple-processor 
systems  has  typically  assumed  that  the  communica¬ 
tion  pattern  is  determined  by  program  behavior  and  is 
not  otherwise  controllable.  Our  approach  is  based  on 
the  observation  that  the  communication  pattern  (mes¬ 
sage  sending  and  processing)  can  often  be  determined 
by  the  run-time  support  system  in  a  user-transparent 
way  as  well  as  by  program  behavior.  This  observa¬ 
tion  has  been  used  by  others  to  reduce  cache  thrash¬ 
ing  by  means  of  array  subscript  analysis  in  nested 
parallel  loop  constructs  for  dynamic  thread  schedul¬ 
ing  [12].  In  a  message-passing  system,  since  the  order 
in  which  the  messages  arrive  at  a  processor  can  not 
be  assumed,  changing  the  order  of  message  processing 
will  typically  not  affect  program  correctness.  We  will 
show  that  the  probability  of  rollback  propagation  in  a 
message-passing  system  can  often  be  greatly  reduced 
by  reordering  the  processing  of  messages. 

Another  contribution  of  this  paper  is  the  measure¬ 
ment  of  actual  rollback  propagation  for  several  par¬ 
allel  programs.  Analysis  of  checkpointing  and  roll¬ 
back  recovery  protocols  hris  typically  been  performed 
by  means  of  theoretical  models.  We  examine  the  de¬ 
gree  of  rollback  propagation  in  parallel  programs  with 
independent  checkpointing  and  three  message  schedul¬ 
ing  algorithms. 
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The  outline  of  the  paper  is  as  follows.  Section  2  de¬ 
scribes  the  system  model;  the  message  scheduling  algo¬ 
rithm  is  presented  in  Section  3;  and  Section  4  gives  our 
evaluation,  the  measurement  of  rollback  propagation 
and  comparisons.  Section  5  discusses  the  limitations 
of  our  approach. 

2  System  Model  and  Checkpoint  Con¬ 
sistency 

The  system  model  considered  in  this  paper  is  a 
message-driven  system  consisting  of  a  number  of  con¬ 
current  processes  for  which  all  process  communication 
between  processors  is  through  message  passing.  Pro¬ 
cesses  on  the  same  processor  share  a  single  message 
queue  and  belong  to  the  same  fail-stop  recovery  unit 
[8,13].  Processes  can  be  dynamically  generated  and 
the  request  for  the  creation  of  a  new  process  is  sent 
out  cis  a  job  message  to  some  processor  according  to 
the  load  balancing  strategy.  When  a  processor  is  ready 
to  process  a  new  message,  it  can  pick  any  one  in  the 
queue,  depending  on  the  scheduling  algorithm,  and 
then  invoke  or  create  the  appropriate  process  as  re¬ 
quested  by  the  message.  Although  this  model  is  usu¬ 
ally  applied  to  distributed-memory  multicomputers, 
recent  work  on  parallel  environments  has  shown  that 
the  message  processing  model  can  also  be  efficiently 
used  on  shared-memory  multiprocessors  [14]. 

During  normal  execution,  the  state  of  each  proces¬ 
sor  is  occasionally  saved  as  a  checkpoint  on  stable  stor¬ 
age.  Let  C Pile  denote  the  ikth  checkpoint  of  processor 
Pi  with  k  >0  and  0  <  i  <  iV  —  1.  where  N  is  the  num¬ 
ber  of  processors.  A  checkpoint  interval  is  defined  to 
be  the  time  between  two  consecutive  checkpoints  on 
the  same  processor  and  the  interval  between  CPik  and 
CPnk  +  \)  is  called  the  tth  checkpoint  interval.  Each 
message  is  tagged  with  the  current  checkpoint  inter¬ 
val  number  and  the  processor  number  of  the  sender. 
Each  processor  takes  its  checkpoint  independently  and 
updates  the  communication  information  table,  or  in¬ 
put  table  [9],  as  follows:  if  at  least  one  message  from 
the  mth  checkpoint  interval  of  processor  pj  has  been 
processed  during  the  previous  checkpoint  interval,  the 
pair  (j,  m)  is  added  to  the  table.  A  checkpoint  space 
reclamation  algorithm  [15]  can  be  periodically  invoked 
by  any  processor  to  reduce  the  space  overhead. 

When  processor  p,  detects  an  error,  it  starts  a 
two-phase  centralized  recovery  procedure  [9].  First,  a 
rollback-initiating  message  is  sent  to  every  other  pro¬ 
cessor  to  request  the  up-to-date  communication  in¬ 
formation.  Each  surviving  processor  takes  a  virtual 


checkpoint  upon  receiving  the  rollback-initiating  mes¬ 
sage  so  that  the  communication  information  during 
the  most  recent  checkpoint  interval  is  also  collected. 
After  receiving  the  responses,  pi  constructs  the  ex¬ 
tended  checkpoint  graph  [7]  and  e.xecutes  the  rollback 
propagation  algorithm  [15]  to  determine  the  local  re¬ 
covery  line.  A  rollback-request  message  is  sent  to  each 
processor  which  then  rolls  back  and  restarts  according 
to  the  local  recovery  line. 

There  are  two  important  situations  concerning  the 
consistency  between  two  checkpoints.  In  Fig.  1(a),  if 
processors  p,-  and  P;  restart  from  CPik  and  CPjm  re¬ 
spectively,  message  m  is  recorded  as  “received  but  not 
yet  sent” .  In  a  general  model  without  the  assumption 
of  deterministic  e.xecution  [16],  message  m  becomes 
an  orphan  message  [6]  and  results  in  inconsistency  be¬ 
tween  CPik  and  CPjm- 


(a)  (b) 


Figure  1;  Checkpoint  consistency  (a)  message  received 
but  not  yet  sent;  (b)  message  sent  but  not  yet  received. 


Fig.  1(b)  illustrates  the  second  situation.  The  mes¬ 
sage  m  becomes  a  lost  message  [6]  according  to  the 
system  state  containing  CPik  and  CPjm-  By  defin¬ 
ing  the  state  of  the  channels  to  be  the  set  of  messages 
sent  but  not  yet  received,  it  has  been  proved  [2,5]  that 
checkpoints  like  CPik  and  CPjm  can  be  considered 
consistent  if  the  corresponding  state  of  the  channels 
is  also  recorded.  Koo  and  Toueg  [3]  assumed  such  a 
state  is  recorded  at  the  sender  side  by  some  end-to- 
end  transmission  protocol.  Another  way  of  recording 
the  channel  state  is  through  message  logging.  Pes¬ 
simistic  logging  protocol  [17,18]  can  ensure  such  a 
state  is  properly  recorded  at  the  receiving  end".  As 
a  result,  we  consider  the  situation  in  Fig.  1(b)  as  con¬ 
sistent. 


^The  recovery  protocol  described  above  can  also  be  modified 
and  applied  to  systems  with  optimistic  logging  [11]. 
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3  Scheduling  Message  Processing 

3.1  Problem  description 

In  communication-induced  checkpointing  [19-22]  a 
checkpoint  is  taken  on  the  sender  side  whenever  com¬ 
munication  between  two  processors  occurs.  Because 
the  rollback  of  a  processor  does  not  affect  any  other 
processor,  rollback  propagation  is  avoided.  Our  mes¬ 
sage  scheduling  algorithm  is  motivated  by  the  above 
schemes.  The  difference  is,  instead  of  inserting  the 
checkpoints  based  on  a  given  communication  pattern, 
we  control  the  communication  pattern  whenever  pos¬ 
sible  according  to  the  fixed  checkpoint  pattern.  The 
receiver  of  each  message  tries  to  delay  the  processing 
of  the  message  until  the  sender  passes  its  next  check¬ 
point  based  on  the  predetermined  checkpoint  interval. 

For  example  in  Fig.  2(a),  message  mo  enters  the 
queue  of  processor  pi  at  point  A  earlier  than  ^he  mes¬ 
sage  mo  at  point  B.  By  using  a  simple  first-in-first-out 
(fifo)  scheduling  algorithm,  pi  will  process  mo  first 
when  it  is  ready  to  process  a  new  message  at  point 
C.  However,  since  the  order  of  processing  these  two 
messages  does  not  affect  program  correctness  in  our 
model,  mj  is  a  better  candidate  at  point  C  because 
its  sender  p2  has  passed  its  next  checkpoint.  If  in¬ 
deed  m2  is  first  processed  and  is  finished  at  point  D 
(Fig.  2(b)),  the  sender  of  message  mo  will  also  have 
passed  its  next  checkpoint.  When  all  the  messages 
can  be  processed  in  this  way,  there  will  be  no  rollback 
propagation. 

The  following  situations  need  to  be  considered  for 
modifying  the  simple  scheme  described  above. 

1.  If  the  processing  of  completes  at  point  D'  in¬ 
stead  of  D,  the  sender  of  mo  has  not  passed  the 
checkpoint  yet.  If  pi  has  to  process  mo  in  or¬ 
der  to  maintain  performance,  the  communication 
pattern  will  no  longer  be  free  of  rollback  propaga¬ 
tion.  However,  the  delayed  processing  of  mes,sage 
mo  does  contribute  to  reducing  the  rollback  prop¬ 
agation  probability  because  it  reduces  the  chance 
that  an  error  in  po  requires  the  rollback  of  pi  be¬ 
cause  of  mo. 

2.  Consider  the  case  shown  in  Fig.  3.  Suppo.se  pi  is 
forced  to  process  mo  at  point  .4.  It  is  clear  tliat 
any  rollback  initiated  by  po  between  .4  and  CPoi 
will  rollback  mo  and  therefore  mg.  It  then  be¬ 
comes  irrelevant  whether  the  message  m'g  is  pro¬ 
cessed  before  CPoi  or  after. 

3.  Fig.  4  illustrates  the  situation  where  the  rollback 
propagation  involves  more  than  two  processors. 


(b) 

Message  Arrival  X  Message  Processing 


Figure  2;  Checkpointing  and  message  processing  (a) 
normal  fifo  processing;  (b)  reordered  processing  for 
reducing  rollback  propagation. 

Although  mi  is  processed  after  pi  has  passed 
CPii,  a  rollback  of  po  initiated  between  A  and 
CPoi  can  still  propagate  the  rollbacks  to  pi  and 
P2  through  mo  and  mi.  Such  a  situation  is  sim¬ 
ilar  to  the  indirect  potential  recaller  (I PR)  rela¬ 
tionship  described  by  Kim,  et.  al  [23,24]. 


Figure  3:  Processing  irrelevant  messages  in  terms  of 
rollback  propagation. 

3.2  The  scheduling  algorithm 

Based  on  the  above  observations,  we  give  the  fol¬ 
lowing  definition; 

Definition:  A  message  m  in  the  queue  of  a  processor 
p  is  safe  if  the  immediate  processing  of  m  by  p  does 
not  incrccise  the  probability  of  rollback  propagation. 


Figure  4:  Rollback  propagation  involving  more  than 
two  processors. 

There  are  two  types  of  safe  messages: 

Type  1:  when  at  least  one  message  from  the  ith 
checkpoint  interval  of  p,  is  processed  by  the  re¬ 
ceiver  pr ,  all  the  messages  from  the  jth  checkpoint 
interval,  j  <  i,  of  p,  are  safe  with  respect  to  pr 
(Fig.  3). 

Type  2:  messages  sent  to  another  process  on  the 
same  processor  are  always  safe  because  the  re¬ 
covery  unit  is  the  processor  and  so  the  rollback 
of  the  sender  of  such  messages  automatically  rolls 
back  the  receiver. 

Definition:  The  hazard  mdex  associated  with  each 
message  m  with  respect  to  the  receiver  p  is  a  measure 
of  the  increased  probability  of  rollback  propagation 
resulting  from  the  immediate  processing  of  m  by  p. 

By  definition,  the  hazard  indices  of  all  safe  mes¬ 
sages  are  zero.  For  each  unsafe  nessage,  the  exact 
hazard  index  might  depend  on  the  communication  be¬ 
tween  other  processors,  for  example  the  message  mi  in 
Fig.  4,  and  is  in  general  impossible  or  difficult  to  cal¬ 
culate.  Since  the  purpose  of  our  work  is  to  “reduce” 
rollback  propagation  instead  of  “eliminating”  it,  we 
are  interested  in  finding  a  feasible,  low-overhead  ap¬ 
proximation  to  the  hazard  index.  In  this  paper,  we 
approximate  the  hazard  index  of  an  unsafe  message 
by  “the  time  to  the  next  checkpoint  of  its  sender”  In 
particular,  when  the  sender  of  a  message  m  takes  its 
next  checkpoint  after  m  was  sent,  the  hazard  index 
of  m  reduces  to  zero  and  m  is  treated  as  a  safe  mes¬ 
sage.  The  actual  implementation  will  be  described 
later.  Experimental  results  in  Section  4  support  the 
use  of  such  an  approximation. 

Our  message  scheduling  algorithm  is  a  prioritized 
scheme  based  upon  the  hazard  index  which  encapsu¬ 
lates  the  run-time  communication  and  checkpointing 
information.  The  higher  the  hazard  index  for  a  mes¬ 
sage,  the  lower  priority  it  is  assigned.  Therefore,  when 


a  processor  is  ready  to  choose  a  new  message  for  pro¬ 
cessing,  it  will  first  choose  one  of  the  safe  messages  if 
there  is  one;  otherwise,  the  unsafe  message  with  the 
lowest  hazard  index  will  be  chosen.  With  this  schedul¬ 
ing  algorithm,  the  probability  of  rollback  propagation 
is  not  increased  until  it  is  necessary  to  keep  the  pro¬ 
cessors  from  idling. 

3.3  Implementation 

In  order  for  the  receiver  to  maintain  the  hazard  in¬ 
dices  for  the  received  messages,  an  additional  piece  of 
information  needs  to  be  piggybacked  on  each  message: 
the  time  to  the  next  checkpoint  of  the  sender  when  the 
message  is  sent.  The  receiver  can  then  properly  man¬ 
age  its  message  queue  based  on  this  information. 

Instead  of  keeping  messages  from  different  proces¬ 
sors  in  the  same  queue,  each  processor  maintains  an 
array  of  sub-queues,  one  for  each  processor.  Each  mes¬ 
sage  enters  its  corresponding  sub-queue  if  its  hazard 
index  is  not  zero  and  is  added  to  the  highest-priority 
safe  queue  if  it  is  a  safe  message.  Three  additional  data 
structures  are  needed  for  proper  queue  management: 

1.  Last.Update.Time  records  the  time  at  which 
the  most  recent  update  of  the  time-to-next- 
checkpoint  information  was  completed.  It  is 
needed  for  the  aging  operation  described  later. 

2.  Last.Known-CP.NumfN]  is  an  array,  with  one 
entry  for  each  processor,  recording  the  most  re¬ 
cent  checkpoint  interval  number  of  every  proces¬ 
sor  that  is  known  to  the  local  processor  based  on 
the  communication  history. 

3.  LasLProcessed.CP-Num[N]  is  an  array  recording 
the  highest  checkpoint  interval  number  of  the  pro¬ 
cessed  messages  from  each  processor.  It  is  used 
for  identifying  Type-1  safe  messages. 

The  updates  of  hazard  indices  and  priorities  take 
place  only  when  a  new  message  arrives  (enqueueing) 
or  when  the  processor  is  about  to  process  the  next 
message  (dequeueing).  The  operations  performed  on 
the  message  queue  for  enqueueing  and  dequeueing  are 
outlined  in  Fig.  5  and  Fig.  6,  respectively.  The  aging 
operation  updates  the  time-to-next-checkpoint  infor¬ 
mation  of  the  last  message  in  each  non-empty  unsafe 
sub-queue  by  the  amount  of  the  difference  between 
current  time  and  Lasi-Updaie_T:me.  If  the  time  to 
the  next  checkpoint  of  a  message  becomes  negative, 
all  the  messages  in  the  same  sub-queue  are  moved  to 
the  safe  queue. 


j*  message  m  from  the  ith  checkpoint  inter¬ 
val  of  Ps  Mrives  at  the  message  queue  Q 
on  pr  */ 

perform  aging  operation  on  Q; 
if  (i  <  Lasi.Known.CP-Num\p,]) 
add  m  to  safe  queue; 
else  { 

if  (i  >  Last.Known.CP-NuTn[p,]) 
LasUKnown.CP-Num\p,]  =  i; 
if  (i  <  Lasi.Processed.CP-Num\p,]) 
add  m  to  safe  queue; 
else 

add  m  to  unsafe  sub-queue[pj]; 

} 

Figure  5:  Operations  for  enqueueing. 

4  Experimental  Evaluation 

Our  message  scheduling  algorithm  is  implemented 
in  the  Chare  Kernel  which  has  been  developed  as 
a  medium-grain,  machine-independent  parallel  lan¬ 
guage  [14].  A  program  written  in  the  Chare  Ker¬ 
nel  language  can  run  on  both  shared-memory  and 
distributed-memory  machines  such  as  Encore  Multi¬ 
max,  Sequent  Symmetry,  and  the  Intel  iPSC/2  hy¬ 
percube.  Our  experiments  are  on  an  eight-processor 
Multimax  510. 

A  Chare  Kernel  program  is  structurally  similar  to 
a  C  program.  It  contains  a  superset  of  the  C  language 
without  global  or  static  variables.  Two  Chare  Kernel 
calls  resulting  in  communication  are: 

1.  CreaieChare()  sends  a  message  to  request  the  cre¬ 
ation  of  a  small  process,  called  chare,  and  the  pro¬ 
cessing  of  the  enclosed  data  by  one  of  the  subrou¬ 
tines,  called  entry  codes,  inside  the  created  chare. 

2.  SendMsgi)  sends  a  message  to  an  existing  chare 
and  requests  the  enclosed  data  to  be  processed  by 
one  of  the  entry  codes. 

There  is  no  recetve.message  statement  in  a  Chare  Ker¬ 
nel  program.  All  messages  are  sent  to  the  copy  of  the 
Kernel  running  on  the  destination  processor  and  man¬ 
aged  by  the  Kernel  according  to  the  scheduling  algo¬ 
rithm  selected  by  the  user.  Messages  sent  by  Create- 
Chare{)  calls  are  kept  in  job  queues  and  those  sent  by 
SendMsgO  calls  enter  tl'®  message  queues.  Each  mes¬ 
sage  queue  has  a  higher  priority  than  the  correspond¬ 
ing  job  queue.  We  apply  our  scheduling  algorithm. 


/*  Pr  is  about  to  choose  a  message  from 
queue  Q  */ 

perform  aging  operation  on  Q; 
if  (safe  queue  is  non-empty) 

choose  a  message  from  safe  queue; 
else  { 

choose  the  unsafe  message  m  with  the 
smallest  hazard  index; 
move  the  remaining  messages  in  the  same 
sub- queue  to  the  safe  queue; 

/*  if  m  is  from  the  ith  checkpoint  interval  of 

P.  V 

Last.Processed,CP-Num\p,]  =  i. 

} 

Figure  6:  Operations  for  dequeueing. 


referred  to  as  PRIoritized  Message  Process  Schedul¬ 
ing  (PRIMPS),  to  both  types  of  queues  and  force  the 
Kerne!  to  dequeue  a  message  from  the  job  queue  when 
there  is  no  safe  message  left  on  the  message  queue.  In 
this  section,  we  will  compare  our  PRIMPS  algorithm 
with  two  alternatives  supplied  by  the  Kernel:  first- 
in-first-out  (fifo)  scheduling  and  last-in-first-out  (lifo) 
scheduling. 

The  four  programs  used  in  this  study  are  Matrix 
multiplication,  Circuit  extraction,  Knight  tour  and  N 
queen.  The  execution  and  checkpoint  parameters  for 
each  program  are  listed  in  Table  1.  Each  checkpoint 
interval  is  arbitrarily  chosen  to  be  appro.ximately  10 
percent  of  the  total  execution  time.  Offsets  between 
corresponding  checkpoints  on  different  processors  are 
introduced  to  study  the  rollback  propagation  resulting 
from  asynchronous  checkpointing.  The  ith  checkpoint 
of  pj  is  taken  at  time  i  *  T  —  j  *  \  where  T  is  the 
checkpoint  interval  and  il,  the  offset  per  processor,  is 
arbitrarily  set  to  T/10.  Our  implementation  of  pe¬ 
riodic  checkpointing  utilizes  the  interrupt  service  rou¬ 
tine  for  UNIX  alarmfr )  system  call  as  the  checkpoint¬ 
ing  routine.  Each  checkpointing  action  is  simulated  by 
inserting  a  constant  delay  (2  seconds).  We  assume  a 
technique  for  detecting  the  messages  which  do  not  re¬ 
quire  logging  [11]  is  employed  so  that  the  overhead  for 
message  logging  is  negligible. 

We  define  several  rollback  statistics  associated  with 
each  communication  pattern  which  encapsulate  vari¬ 
ous  costs  for  rollback  recovery.  They  allow  quanti¬ 
tative  comparison  between  different  scheduling  algo¬ 
rithms. 


Table  1;  Execution  and  checkpoint  parameters  of  the  Chare  Kernel  programs. 


Benchmark 

Matrix 

Circuit 

Knight 

N 

programs 

multiplication 

extraction 

tour 

Queen 

Number  of  processors 

4 

4 

6 

■BHI 

Execution  time  (sec) 

290.07 

252.51 

280.15 

1507.78 

Checkpoint  interval  (sec) 

30 

30 

30 

150 

Offset  per  processor  (A) 

3 

3 

3 

15 

Table  2:  Rollback  statistics. 


Benchmark 

programs 

Matrix 

multiplication 

Circuit 

extraction 

Knight 

tour 

N 

Queen 

rbcp  lifo 

5.42 

1.25 

30.50 

30.61 

fifo 

2.54 

1.15 

31.72 

32.19 

1.17 

1.13 

1.62 

1.28 

rbpe  lifo 

2.83 

1.19 

5.73 

5.43 

fifo 

2.09 

1.12 

5.78 

5.40 

PRIMPS 

1.13 

1.10 

1.53 

1.23 

ckpl  lifo 

0.537 

0.045 

0.906 

0  841 

fifo 

0.224 

0.019 

0.906 

0.833 

PRIMPS 

0.039 

0.016 

0.044 

0.023 

Execution  lifo 

290.07 

252.51 

280.15 

1507.78 

time  (sec)  fifo 

299.50 

256.88 

282.50 

1635.18 

PRIMPS 

304.56 

254.34 

278.50 

1526.92 

Performance  degradation 

OlWo 

1.  The  average  of  the  total  number  of  roUed-back 
checkpoints  (rbcp)  due  to  the  rollback  initiation 
of  a  processor. 

2.  The  average  of  the  total  number  of  rolled-back  pro¬ 
cessors  (rbpe)  due  to  the  rollback  initiation  of  a 
processor. 

3.  The  probability  of  rolling  back  at  least  one  pro¬ 
cessor  to  some  checkpoint  before  the  most  recent 
checkpoint  (ckpl). 

Communication  traces  were  collected  by  intercepting 
the  CreateChareQ  and  SendMsg()  calls,  and  recording 
the  time  each  message  was  dequeued.  Communica¬ 
tion  trace-driven  simulation  was  then  performed  on 
the  traces  to  obtain  the  rollback  statistics  shown  in 
Table  2  and  Fig.  7.  All  the  reported  numbers  were 
computed  by  averaging  over  five  runs. 

The  results  show  that  the  degree  of  rollback  prop¬ 
agation  varies  widely  between  different  programs  and 


different  offsets.  In  all  cases,  our  PRIMPS  algorithm 
reduces  the  cost  of  rollback  recovery  and  the  proba¬ 
bility  of  rollback  propagation  at  the  expense  of  less 
than  5  percent  performance  degradation.  Note  that 
the  performance  degradation  reported  here  is  mea¬ 
sured  against  the  execution  with  lifo  scheduling  and 
with  simulated  checkpointing  overhead.  Therefore,  it 
represents  the  overhead  required  for  performing  the 
prioritized  scheduling  above  the  normal  overhead  for 
supporting  checkpointing. 

One  potential  disadvantage  of  independent  check¬ 
pointing  schemes  is  that  slower  recovery  due  to  pos¬ 
sible  rollback  propagation  may  make  it  unsuitable  for 
real-time  applications.  Table  2  shows  that  the  rollback 
cost  (rbcp)  and  restarting  cost  {rbpe)  for  the  PRIMPS 
algorithm  are  not  only  greatly  reduced  but  also  re¬ 
duced  to  numbers  less  than  two.  Since  the  miniinum 
value  for  both  rbcp  and  rbpe  is  one  (the  rollback  of 
the  faulty  processor),  the  statistics  in  Table  2  imply 
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Offset  per  processor  (sec) 


Figure  7;  Comparison  of  rollback  statistics  as  a  function  of  the  offset  per  processor  A  for  the  N  Queen  program. 


that  rollback  recovery  for  independent  checkpointing 
can  be  made  faster  by  using  the  PRIMPS  algorithm. 

The  small  percentage  (less  than  5%)  of  poten¬ 
tial  rollbacks  contributing  to  the  non-zero  ckpVs  for 
the  PRIMPS  algorithm  exist  mostly  at  the  starting 
phase  or  the  ending  phase  of  the  program  execution 
where  the  number  of  messages  is  small.  Although  the 
checkpoints  are  not  explicitly  synchronized,  this  result 
shows  the  set  of  most  recent  checkpoints  still  forms  a 
consistent  recovery  line  with  high  probability  (higher 
than  0.95)  for  the  PRIMPS  algorithm. 

Fig.  7  shows  that  the  degree  of  rollback  propaga¬ 
tion  becomes  worse  as  the  offset  A  increases  for  all 
three  algorithms.  However,  it  is  less  sensitive  to  the 
offset  for  the  PRIMPS  algorithm.  Therefore,  our  ap¬ 
proach  is  particularly  attractive  for  applications  where 
such  offsets  may  exist  and  the  synchronization  cost  for 
always  maintaining  a  consistent  set  of  checkpoints  is 
prohibitively  high. 


5  Limitations 

Although  the  actual  implementation  of  logging  and 
checkpointing  will  have  little  impact  on  our  compari¬ 
son  of  rollback  statistics,  it  does  affect  the  performance 
degradation.  For  some  applications,  the  PRIMPS  al¬ 
gorithm  might  tend  to  start  more  jobs  than  is  nec¬ 
essary  to  keep  the  system  busy  in  order  to  delay  the 
processing  of  unsafe  messages.  Therefore,  the  memory 


usage  and  the  overhead  for  checkpointing  and  message 
logging  may  be  higher. 

Sometimes  priorities  are  attached  to  the  messages 
in  order  to  ensure  correctness  or  achieve  better  load 
balancing.  Since  our  message  scheduling  algorithm 
can  not  schedule  messages  across  different  priority 
groups,  there  are  fewer  messages  available  for  schedul¬ 
ing  in  each  group  and  the  algorithm  will  be  less  effec¬ 
tive. 


6  Concluding  Remarks 

In  this  paper,  we  have  exploited  the  property  that 
reordering  the  message  processing  can  change  the  com¬ 
munication  pattern  to  reduce  the  occurrence  of  roll¬ 
back  propagation  without  affecting  program  correct¬ 
ness.  The  concepts  of  safe  messages  and  hazard  indices 
are  introduced  as  the  basis  for  our  run-time  prioritized 
scheduling  algorithm.  Experimental  results  based  on 
the  communication  trace-driven  simulation  for  several 
parallel  programs  show  that  the  probability  of  roll¬ 
back  propagation  can  be  greatly  reduced  at  the  cost 
of  reasonable  performance  degradation. 
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